LiDA: Language-Independent Data Augmentation for Text Classification

نویسندگان

چکیده

Developing a high-performance text classification model in low-resource language is challenging due to the lack of labeled data. Meanwhile, collecting large amounts data cost-inefficient. One approach increase amount create synthetic using augmentation techniques. However, most available techniques work on English and are highly language-dependent as they perform at word sentence level, such replacing some words or paraphrasing sentence. We present Language-independent Data Augmentation (LiDA), technique that utilizes multilingual from training dataset. Unlike other methods, our worked embedding level independent any particular language. evaluated LiDA three languages various fractions dataset, result showed improved performance both LSTM BERT models. Furthermore, we conducted an ablation study determine impact components method overall performance. The source code https://github.com/yest/LiDA .

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Text Classification using Language-independent Pre-processing

A number of language-independent text pre-processing techniques, to support multi-class single-label text classification, are described and compared. A simple but effective statistical keyword identification approach is proposed, coupled with a number of phrase identification mechanisms. Experimental results are presented.

متن کامل

Language independent semantic kernels for short-text classification

Short-text classification is increasingly used in a wide range of applications. However, it still remains a challenging problem due to the insufficient nature of word occurrences in short-text documents, although some recently developed methods which exploit syntactic or semantic information have enhanced performance in short-text classification. The language-dependency problem, however, caused...

متن کامل

A Hybrid Statistical Data Pre-processing Approach for Language-Independent Text Classification

Data pre-processing is an important topic in Text Classification (TC). It aims to convert the original textual data in a data-mining-ready structure, where the most significant text-features that serve to differentiate between textcategories are identified. Broadly speaking, textual data pre-processing techniques can be divided into three groups: (i) linguistic, (ii) statistical, and (iii) hybr...

متن کامل

Data Augmentation for Plant Classification

Data augmentation plays a crucial role in increasing the number of training images, which often aids to improve classification performances of deep learning techniques for computer vision problems. In this paper, we employ the deep learning framework and determine the effects of several data-augmentation (DA) techniques for plant classification problems. For this, we use two convolutional neura...

متن کامل

Language sensitive text classification

It is a traditional belief that in order to scale-up to more effective retrieval and access methods modern Information Retrieval has to consider more the text content. The modalities and techniques to fit this objectives are still under discussion. More empirical evidence is required to determine the suitable linguistic levels for modeling each IR subtask (e.g. information zoning, parsing, feat...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Access

سال: 2023

ISSN: ['2169-3536']

DOI: https://doi.org/10.1109/access.2023.3234019